Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition around subprocess inference state tidyup #2008

Merged
merged 4 commits into from
Jul 4, 2024

Conversation

PeterJCLaw
Copy link
Collaborator

Summary

There was a race condition due to the combination of Python's object ids being re-usable and Jedi persisting such ids beyond the real lifeteime of some objects. This could lead to the subprocess' view of the lifetime of InferenceState contexts getting out of step with that in the parent process and resulting in errors when removing them. It is also possible that this could result in erroneous results being reported, however this was not directly observed.

The race was specifically:

  • InferenceState A created, gets id 1
  • InferenceStateSubprocess A' created, uses InferenceState A which it stores as a weakref and an id
  • InferenceStateSubprocess A' is used, the sub-process learns about an InferenceState with id 1
  • InferenceState A goes away, InferenceStateSubprocess A' is not yet garbage collected
  • InferenceState B created, gets id 1
  • InferenceStateSubprocess B' created, uses InferenceState B which it stores as a weakref and an id
  • InferenceStateSubprocess B' is used, the sub-process re-uses its entry for an InferenceState with id 1

At this point the order of operations between the two InferenceStateSubprocess instances going away is immaterial -- both will trigger a removal of a state with id 1. As long as B' doesn't try to use the sub-process again after the first removal has happened then the second removal will fail.

This change resolves the race condition by coupling the context in the subprocess to the corresponding manager class instance in the parent process, rather than to the consumer InferenceState.

See inline comments for further details.

Code review

Suggest review by commit. There's some small refactors in the early commits. The actual fix is the last commit on the branch.

Testing

I'm open to ideas on how to add a test for this.

Currently this is relying on Python 3.13 tests reproducing the issue fairly easily, though not entirely consistently.

Links

Currently builds on #2003 (as it was discovered there), though it doesn't really need to be part of the upgrade process.

@PeterJCLaw PeterJCLaw changed the title Ensure unique subprocess reference ids Fix race condition around subprocess inference state tidyup Jul 1, 2024
Copy link
Owner

@davidhalter davidhalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is insane. Thanks a lot for documenting all of these details. It's extremely rare to be gifted with pull requests of this quality! Feel free to merge and thanks again for dealing with all of my old code that's really not that easy to understand ❤️

@PeterJCLaw
Copy link
Collaborator Author

Thanks :D

For clarity: I'm keeping this on hold as I've started seeing test failures on my python3.13 branch having rebased on top of this branch (and the changes here now rebased on master). I'm not sure of the cause just yet as the failures are new and didn't show up in PeterJCLaw#3 despite that PR ought to having been functionally the same code. The only difference between the two I think is my linting PR #2006, which I hadn't expected to change anything.
I'll hopefully have some time this week to pin down the issue.

This removes some of the coupling between the management of the
underlying process and the inference state itself, which intends
to enable changing the origin of the id. This will be useful in
the next commit.
I'm not sure where this was used in the past, however it appears
to be unused now. Removing this simplifies a change I'm about to
make to _InferenceStateProcess.
This is derived from my understanding of the code, plus a bit of
experimentation.
There was a race condition due to the combination of Python's
object ids being re-usable and Jedi persisting such ids beyond
the real lifeteime of some objects. This could lead to the
subprocess' view of the lifetime of `InferenceState` contexts
getting out of step with that in the parent process and
resulting in errors when removing them. It is also possible
that this could result in erroneous results being reported,
however this was not directly observed.

The race was specifically:
- `InferenceState` A created, gets id 1
- `InferenceStateSubprocess` A' created, uses `InferenceState`
  A which it stores as a weakref and an id
- `InferenceStateSubprocess` A' is used, the sub-process learns
  about an `InferenceState` with id 1
- `InferenceState` A goes away, `InferenceStateSubprocess` A' is
  not yet garbage collected
- `InferenceState` B created, gets id 1
- `InferenceStateSubprocess` B' created, uses `InferenceState` B
  which it stores as a weakref and an id
- `InferenceStateSubprocess` B' is used, the sub-process re-uses
  its entry for an `InferenceState` with id 1

At this point the order of operations between the two
`InferenceStateSubprocess` instances going away is immaterial --
both will trigger a removal of a state with id 1. As long as B'
doesn't try to use the sub-process again after the first removal
has happened then the second removal will fail.

This commit resolves the race condition by coupling the context
in the subprocess to the corresponding manager class instance
in the parent process, rather than to the consumer `InferenceState`.

See inline comments for further details.
@PeterJCLaw PeterJCLaw force-pushed the ensure-unique-subprocess-reference-ids branch from a70bc50 to a67deeb Compare July 2, 2024 20:38
@PeterJCLaw
Copy link
Collaborator Author

For clarity: I'm keeping this on hold as I've started seeing test failures on my python3.13 branch having rebased on top of this branch (and the changes here now rebased on master). I'm not sure of the cause just yet as the failures are new and didn't show up in PeterJCLaw#3 despite that PR ought to having been functionally the same code.

I believe these are down to some changes in Python 3.13.0-beta.3 (where previously CI was using beta.2), and I'm happy that that's not related to the changes in this PR.

@PeterJCLaw PeterJCLaw marked this pull request as ready for review July 2, 2024 21:43
@davidhalter
Copy link
Owner

That makes sense. The pull request itself feels very simple and I am not surprised it's not your fault :)

@PeterJCLaw PeterJCLaw merged commit e839683 into master Jul 4, 2024
228 checks passed
@PeterJCLaw PeterJCLaw deleted the ensure-unique-subprocess-reference-ids branch July 4, 2024 21:40
@PeterJCLaw PeterJCLaw mentioned this pull request Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants